Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

13.2 Botryology

161

Fig. 13.1 Each object is represented by a cross corresponding to its value of the chosen feature on

the real line double struck upper RR. Clusters C1, C2, and C3 are easily identiﬁable. The spot on the line represents a

possible value that could be used to divide the set dichotomously

waking hours (in other words, the comparison of unknown objects with known pro-

totypes), of more current interest in bioinformatics is the unsupervised discernment

of patterns in, for example, gene and genome sequences, especially since the pro-

portion of unknown material in genomes is still overwhelming. ⁵A very powerful

methodology for achieving that is to examine whether the data resulting from some

operations carried out on a DNA sequence (for example) can be arranged in such a

way that structure appears, namely that groups of data points constituting a subset

of the entire dataset are clumped together to form two or more distinct entities.

The clustering process is deﬁned as the partition of a set of objects by some features

into disjoint subsets, and each subset in which objects are united by some features

is called the cluster. If no relation between the objects is known, it is impossible to

construct clusters.

The simplest case of clustering arises when only one feature exists; each object

under consideration either has the feature or does not, in which case the maximum

number of clusters is two and, if it happens that all the objects have that feature, then

there will be only one cluster. Another simple case arises if values from the real-

number line can be attributed to the feature (Fig. 13.1). This is easily generalized to

two or more dimensions, the number of dimensions being equal to the number of

chosen features.

If the set of objects is large and many features have been chosen, it is necessary

to have algorithms for clustering that enable it to be carried out automatically on

the computer. Many such algorithms are known; a few of them are brieﬂy described

below. ⁶It is assumed that there is a set StartSet upper X EndSet{X} of objects (upper X Subscript iXi, etc.) in upper NN-dimensional

feature space. For ease of representation, we will tacitly consider upper N equals 2N = 2.

Hyperspheres. A circle of radius rr is drawn around an arbitrarily chosen object.

Objects within the circle form the ﬁrst subcluster. New circles are now drawn with

their centres at these other objects, which encompass yet more objects, around which

new circles are again drawn, and so forth until no new objects are added. If all of

the objects in the set are now included, the process has failed. If, on the other hand,

objects remain, then one of those remaining objects is arbitrarily chosen and the

process is repeated.

5 We also have the intermediate process of semisupervised learning, which deals with the problem

of combining small amounts of labelled data with large amounts of unlabelled data—the classic

paper is Zhu et al. (2003).

6 See Verulava et al. (2009) for the “rank of links” method.